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Abstract. Reinforcement learning is commonly used with function approxi- 
mation. However, very few positive results are known about the convergence 
of function approximation based RL control algorithms. In this paper we 
show that TD(0) and Sarsa(O) with linear function approximation is conver- 
gent for a simple class of problems, where the system is linear and the costs are 
quadratic (the LQ control problem). Furthermore, we show that for systems 
with Gaussian noise and non-completely observable states (the LQG problem), 
the mentioned RL algorithms are still convergent, if they are combined with 
Kalman filtering. 



1. Introduction 

Reinforcement learning is commonly used with function approximation. How- 
ever, the technique has little theoretical performance guarantees: for example, it 
has been shown that even linear function approximators (LEA) can diverge with 
such often used algorithms as Q-learning or value iteration [3d]- There are positive 
results as well: it has been shown [10l[7l[9] that TD(A), Sarsa, importance-sampled 
Q-learning are convergent with LFA, if the policy remains constant (policy eval- 
uation). However, to the best of our knowledge, the only result about the con- 
trol problem (when we try to find the optimal policy) is the one of Gordon's [4], 
who proved that TD(0) and Sarsa(O) can not diverge (although they may oscillate 
around the optimum, as shown in [3] j]]. 

In this paper, we show that RL control with linear function approximation can 
be convergent when it is applied to a linear system, with quadratic cost functions 
(known as the LQ control problem). Using the techniques of Gordon [4], we were 
prove that under appropriate conditions, TD(0) and Sarsa(O) converge to the opti- 
mal value function. As a consequence, Kalman filtering with RL is convergent for 
observable systems, too. 

Although the LQ control task may seem simple, and there are numerous other 
methods solving it, we think that this Technical Report has some significance: (i) 
To our best knowledge, this is the first paper showing the convergence of an RL 
control algorithm using LFA. (ii) Many problems can be translated into LQ form 



*Last updated: 22 October 2006. 

^These are results for policy iteration (e.g. [5]). However, by construction, policy iteration 
could be very slow in practice. 
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2. THE LQ CONTROL PROBLEM 

Consider a linear dynamical system with state xj £ M. n , control Ut £ R m , in 
discrete time t: 

(1) x t+1 = Fx t + Gu t . 
Executing control step u t in x t costs 

(2) c(x 4 , u t ) := xjQx t + ufRu t , 

and after the N th step the controller halts and receives a final cost of x^QjvXat. 
The task is to find a control sequence with minimum total cost. 

First of all, we slightly modify the problem: the run time of the controller 
will not be a fixed number N. Instead, after each time step, the process will be 
stopped with some fixed probability p (and then the controller incurs the final cost 
Cf(x.f) := xjQ-^xj). This modification is commonly used in the RL literature; it 
makes the problem more amenable to mathematical treatments. 

2.1. The cost-to-go function. Let V t *(x) be the optimal cost-to-go function at 
time step t, i.e. 

(3) V*(x):= inf E[c(x t , u t ) + c(x t +i, u t+ i) + . . . + c/(x/)|x t = x] . 

U t ,U t + i,... 1 

Considering that the controller is stopped with probability p, Eq. [3] assumes the 
following form 

(4) t?(x) = p ■ C/ (x) + (1 - p) inf (c(x, u) + V t * +1 (F X + Gu) 

for any state x. It is an easy matter to show that the optimal cost-to-go function is 
time-independent and it is a quadratic function of x. That is, the optimal cost-to-go 
action-value function assumes the form 

(5) V*(x) =x T n*x. 

Our task is to estimate the optimal value functions (i.e., parameter matrix IT*) 
on-line. This can be done by the method of temporal differences. 

We start with an arbitrary initial cost-to-go function Vb(x) = x T n x. After 
this, 

(1) control actions are selected according to the current value function estimate 

(2) the value function is updated according to the experience, and 

(3) these two steps are iterated. 

The t th estimate of V* is Vt(x) = x T n t x. The greedy control action according 
to this is given by 

(6) u t = argmin(c(x t , u) + Vt(Fx. t + GuU 

= argmin(u T .Ru + (Fx f + Gu) T IL.(.Fx t + Gu)^ 

= -(i? + G T n t G)- 1 (G T n tJ F) Xf . 

The 1-step TD error is 



(7) 



c /( x t)-^( x t) ir t = t S TOP, 

(c(x t , u t ) + Vt(x t+ i)) - V t (x t ), otherwise. 
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Initialize x , u , n 
repeat 

x t+ i = Fx t + Guj 
vt+i '■= random noise 

u t+1 = -(i? + G T n t+1 G)- 1 (G T n t+1 F)x t+1 +u t+1 
with probability p, 

S t = xf Q f x t - xf II t x t 

STOP 

else 

s t = ufRu t + xf +1 n t x t +i - xf n t x t 
n t+i = n t + a t s t x t xf 
t = t + i 



Figure 1. TD(0) with linear function approximation for LQ control 



and the update rule for the parameter matrix II t is 

(8) n t+1 = U t + afSfVu t V t ( x t) 

= Tl t + at ■ S t ■ x t xf , 

where at is the learning rate. 

The algorithm is summarized in Fig. [T] 

2.2. Sarsa. The cost-to-go function is used to select control actions, so the action- 
value function Qt(x, u) is more appropriate for this purpose. The action- value 
function is defined as 

Qj(x,u):= inf E[c(x t , u t ) + c(x t+1 , u t +i) + . ■ ■ + c/(x/)|x t = x, u t = u] , 

U t+ l,U t+ 2,... 

and analogously to V t *, it can be shown that it is time independent and can be 
written in the form 

u 





©1 2 \ 






©2 2 y 


9 



(9) Q*(x,uH(x^ u T ) Xi 2 r =( xT uT ) Q 



Note that IT can be expressed by 9* using the relationship V^(x) = min u Q(x, u): 

(io) n* = e* n -e* 12 (e* 22 )- 1 e* 21 

If the t th estimate of Q* is Qt(x, u) = [x T , u T ] T Gj[x T , u T ], then the greedy control 



action is given as 

y 22 



(11) u t = argminQ t (x, u) = -6 22 l 6>21 ~l &21 x t = -6 22 1 9 2 ix t , 



where subscript t of O has been omitted to improve readability. 

The estimation error and the weight update are similar to the state- value case: 



(12) 



c f {x t ) - Q t (x t ,u t ) \it = t STOP , 

(c(x t ,u t ) + Qt(x t+ i,u t+ i)) -Q t (x t ,u t ), otherwise, 
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Initialize x , u , Co 
z = (x^uJ) T 
repeat 

x t+ i = Fxf + Gu t 

vt+i '■= random noise 

Ut+l = -(6 t )22(0t)2lX t+ i + 1/4+1 

z <+i = (xf+i u T+i) T 

with probability p, 

S t = x t r Q / x t - zfQtZt 
STOP 

else 

S t = ujRu t + zj +1 <d t zt+i - zjQ t z t 
©t+i = ©t + a t 5 t z t zf 
t = t + l 

end 



Figure 2. Sarsa(O) with linear function approximation for LQ control 
(13) 64+1 = G t + afSf V 0t Q t (x t ,u t ) 

- * ft)' 

The algorithm is summarized in Fig. [2] 

3. Convergence 

Theorem 3.1. J/ n > II*, there exists an L such that \\F + GL\\ < — p, 

then there exists a series of learning rates a t such that < a t < 1/ ||x t || , J^t a t 
x . ot"l < 00, and it can be computed online. For all sequences of learning rates 
satisfying these requirements, Algorithm^ converges to the optimal policy. 

The proof of the theorem can be found in Appendix [Bl 

The same line of thought can be carried over for the action-value function 
Q(x, u) = (x T u T ) T 0(x T u T ), which we do not detail here, giving only the result: 

Theorem 3.2. If 9 > 0* , there exists an L such that \\F + GL\\ < 1/yT — p, 
then there exists a series of learning rates at such that < at < 1/ ||x t || , at = 
00, a t < an d it can be computed online. For all sequences of learning rates 
satisfying these requirements, Sarsa(O) with LFA (Fig. 0) converges to the optimal 
policy. 

4. Kalman filter LQ control 

Now let us examine the case when we do not know the exact states, but we have 
to estimate them from noisy observations. Consider a linear dynamical system with 
state xt S R", control u t G K m , observation y t S K fe , noises f t e W l and ( t S R k 
(which are assumed to be uncorrelated Gaussians with covariance matrix f2^ and 
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f^, respectively), in discrete time t: 

(14) x t+ i = Fxt+Gut + b 

(15) y t = Hx t + ( t . 

Assume that the initial state has mean xi, and covariance Si. Furthermore, assume 
that executing control step u t in x t costs 

(16) c(x t , u t ) := xf Qx t + ufRu t , 

After each time step, the process will be stopped with some fixed probability p, and 
then the controller incurs the final cost c/(x/) := xjQ^Xf. 

We will show that the separation principle holds for our problem, i.e. the control 
law and the state filtering can be computed independently from each other. On one 
hand, state estimation is independent of the control selection method (in fact, the 
control could be anything, because it does not affect the estimation error), i.e. we 
can estimate the state of the system by the standard Kalman filtering equations: 

(17) xt+i = Fx t + Gu t + K t (y t - Hx t ) 

(18) K t = FY, t H T {HY, t H T + W)- 1 

(19) Et+i = Q w + FZ t F T - K t H^ t F T . 

On the other hand, it is easy to show that the optimal control can be expressed 
as the function of x t . The proof (similarly to the proof of the original separation 
principle) is based on the fact that the noise and error terms appearing in the 
expressions are either linear and have zero mean or quadratic and independent of 
u. In both cases they can be omitted. More precisely, let Wt denote the sequence 
yi, . . . , y t , Ui, . . . , Ut-x, and let e t = x t — x t . Equation ^ for the filtered case can 
be formulated as 

(20) u t = argmin£(c(x t ,u) + Ft(Fx t + Gu + &)|l^ t ) 

= argminE'lx^Qxt + u T i?u + 
u V 

(Fx t +Gu + Z t ) T n t {Fx t + Gu + 6) Wt 



Using the fact that E(xfQx t \Wt) and E(£[lLt£t\Wt) are independent of u and that 
E((Fx t + Gu) T IL t £t\Wt) — 0, furthermore that x t = x t + e t , we get 

u t = arg min E (u T Ru + (Fx t + Fe t + Gu) T n t (Fx t + Fe t + Gu) 



Wi 

Finally, we know that E(e t \Wt) — 0, because the Kalman filter is an unbiased 
estimator, furthermore _E(e^n t e t |T / l /r () is independent of u, which yields 

u t = arg min E (u T Ru + (Fx t + Gu) T IL (Fx t + Gu) W t 
u V 

= -(i? + G T n t G)- 1 (G T n t F)x t , 

i.e. for the computation of the greedy control action according to Vt we can use 
the estimated state instead of the exact one. The proof of the separation principle 
for SARSA(O) is quite similar and therefore is omitted here. 

The resulting algorithm using TD(0) is summarized in Fig. [3l The algorithm 
using Sarsa can be derived in a similar manner. 
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Initialize x , x , u 0: n , S 
repeat 

x t+ i = Fx t + Gu t + £ t 
y t = Hxt + ( t 

S t+ i = fi« + FH t F T - K t HH t F T 
K t = FY, t H T {HT, t H T + O^)" 1 
x*+i = Fx t + Gu t + Jf t (y t - ffx t ) 
i/ t+ i := random noise 

u t+ i = + G T n t+1 G)- 1 (G T n t+1 F)x t+1 + 

with probability p, 

S t = xfQ f x t - xf n t x t 
STOP 

else 

<5 4 = uf #u t + x^ +1 n 4 x t+ i - xf U t x t 

n t+ i = n t + a t s t x t xf 
t = t + i 

end 



FIGURE 3. Kalman filtering with TD control 
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Appendix A. The boundedness of ||x t || 

We need several technical lemmas to show that |jx t || remains bounded for the 
linear-quadratic case, and also, i?(||x t ||) remains bounded for the Kalman filter 
case. The latter result implies that for the KF case, ||x t || remains bounded with 
high probability. 

For any positive semidefinite matrix II and any state x, we can define the action 
vector which minimizes the one-step-ahead value function: 

U-greedy '■= argmin^u T i?u + (Fx + Gu) T n(Fx + Gu)j 
= -(R + G T UG)- 1 (G T UF)x. 

Let 

L n := -(R + G T UG)- 1 (G T UF) 
denote the greedy control for matrix II, and let 

L* = -(R + G T n*G)" 1 (G T n* J F 1 ) 
be the optimal policy, furthermore, let q :— — p. 

Lemma A.l. If there exists an L such that \\F + GL\\ < q, then \\F + GL*\\ < q 
as well. 
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Proof. Indirectly, suppose that ||F + GL*\\ > q. Then for a fixed xo, let x t be the 
optimal trajectory 

x* + i = (F + GL*)x t . 

Then 

V*(x ) = F/N +(1 -p)c(x ,L*xo) 
+ (l-p)p C/ (xi) + (l-p) 2 c(x 1 ,L*x 1 ) 
+ (l-p) 2 f>c/(x 2 ) + (l-p) 3 c(x 2 ,L*x 2 ) 
+ 

^*(x ) > p( C/ (x ) + (l- 23 ) C/ (x 1 ) + (l- 2 ,) 2 c / (x 2 ) + ...) 
= p^(l - p) fe x^(F + GL*f T Qf{F + GL*) fe x . 

We know that is positive definite, so there exists an e such that x T Q^x > e ||x|| 2 , 
therefore 

V*(xo) > e P J2a-p) k \\(F + GL*) k xo\\ 2 - 

If x is the eigenvector corresponding to the maximal eigenvalue of F + GL* , then 
(F + GL*)x = \\F + GL*\\x , and so (F + GL*) fe x = ||F + GL*f x . Conse- 
quently, 

^*(x ) > e p^(l-p) fe ||^ + Gi*|| 2fe ||x || 2 

On the other hand, because of ||F + GL| < q, it is easy to see that the value of 
following the control law L from xo is finite, therefore we get V l (xq) < V*(xo), 
which is a contradiction. □ 

Lemma A. 2. For positive definite matrices A and B, if A > B then ||yl _1 _B|| < 1. 

Proof. Indirectly, suppose that 1 1 ^4 1 .Z? 1 1 > 1. Let X max be the maximal eigenvalue 
of A~ X B, and v be a corresponding eigenvector. 

A~ 1 B\T = XmaxV, 

and according to the indirect assumption, 

X m ax = \\A 1 B\\ > 1. 

A > B means that for each x, x T Ax > x T J5x, so this holds specifically for x = 
A~ x Bv = X max v, too. So, on one hand, 

(A max v) T B(A ma;l; v) = X 2 max w T Bw > v T 5v, 

and on the other hand, 

{X max w) T A{X max w) = (A~ x Bv) T A{A~ l B\) = w T {BA- 1 B)w, 

so, 

w T {BA~ 1 B)v > v T Bv, 

However, from A > B, A^ 1 < B^ 1 . Multiplying this with B from both sides, 
we get BA~ X B < B, which is a contradiction. □ 
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Lemma A. 3. If there exists an L such that \\F + GL\\ < q then for any II such 
that n > IT, ||F + GLu\\ < q, too. 

Proof. We will apply the Woodbury identity 6 , stating that for positive definite 
matrices R and II, 

(R + G T UG)- 1 G T U = R- 1 G T (GR- 1 G T + IT 1 )- 1 

Consequently, 

F + GLn = F -G(R + G T UG)- 1 (G T UF) 

= F - (GR~ 1 G T ^j ({GR~ 1 G T + H~ 1 )~ 1 ^F. 

Let 

u u ■■= i- (gr- 1 g t )(gr- 1 g t + n- x ) 
= ir^Gir^ + ir 1 ) 

and 

U* := I - (GR- 1 G T ^ (GR- l G T + (IT)- 1 ) 

= (n*)- 1 (Gi?- 1 G T + (n*)- 1 ) -1 

Both matrices are positive definite, because they are the product of positive definite 
matrices. With these notations, F + GL n = U n F and F + GL* = U*F. 

It is easy to show that Un < U* exploiting the fact that II > LT* and several well- 
known properties of matrix inequalities: if A > B and C is positive semidefinite, 
then -A < —B, A- 1 < B~ x , A + C > B + C, A ■ C > B ■ C. 

From Lemma lA.ll we know that ||{/*-F|| = \\F + GL*\\ < q, and from the previous 
lemma we know that ||[/^(J7*) _1 || < 1, so 

||F + GLn | = \\UnF\\ = WUniU*)-^* F\\ < ||C/ n (^*) _1 || 11^*^11 < 1 • q 

□ 

Corollary A. 4. If there exists an L such that \\F + GL\\ < q, then the state 
sequence generated by the noise-free LQ equations is bounded, i.e., there exists M £ 
R such that ||x t || < M. 

Proof. This is a simple corollary of the previous lemma: in each step we use a 
greedy control law L t , so 

||x t+1 || = \\(F + GLt)x t \\ <«||x t || 

□ 

Corollary A. 5. If there exists an L such that \\F + GL\\ < q, then the state 
sequence generated by the Kalman-jilter equations is bounded with high probability, 
i.e., for any e > 0, there exists MeR such that ||xt|| < M with probability 1 — e. 

Proof. 

L||x m || = E\\{F + GL t )*t+Zt\\ < \Je II (F + GL t )x t || +Sl ( 

< ^qE\\^ t \\ + n ( , 
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so there exists a bound M 1 such that E ||xt|| < M' . From Markov's inequality, 

Pr(||x t || > M'/e) < e, 
therefore, M — M'/e satisfies our requirements. □ 

Appendix B. The proof of the main theorem 
We will use the following lemma: 

Lemma B.l. Let J be a differ entiable function, bounded from below by J* , and let 
VJ be Lips chitz- continuous. Suppose the weight sequence wt satisfies 

w t+ i =w t + a t b t 

for random vectors b t independent of Wt+i,Wt+2, ■ ■ ■> and bt is a descent direction 
for J, i.e. E(bt\wt) T VJ(wt) < —5(e) < whenever J(wt) > J* + e. Suppose also 
that 

E{\\b t \\ 2 \w t ) < KiJ(wt) + K 2 E{b t \wt) T VJ{wt) + K 3 

and finally that the constants a t satisfy a t > 0, ^ t a t — oo, ^ t oq < oo. Then 
J(wt) — > J* with probability 1. 

In our case, the weight vectors are n x n dimensional, with w n .i + j := ILij. 
For the sake of simplicity, we denote this by ttf(y). Let w* be the weight vector 
corresponding to the optimal value function, and let 

1 2 
J(w) = - \\w-w*\\ . 

Theorem B.2 (TheoremlXT]). // n > n* . there exists an L such that \\F + GL\\ < 
q, then there exists a series of learning rates ct t such that < a t < l/|Jx t |j , 
^2 t Oit — °°> X)t a t < °°> an d it can be computed online. For all sequences of 
learning rates satisfying these requirements, Algorithm [7] converges to the optimal 
policy. 

Proof. First of all, we prove the existence of a suitable learning rate sequence. Let 
a' t be a sequence of learning rates that satisfy two of the requirements, J^t a * — 00 
and J2 t a\ < oo. Fix a probability < e < 1. By the previous lemma, there exists 
a bound M such that ||x t || < M with probability 1 — e. The learning rates 

a t := min{a' t , 1/ ||x t || 4 } 

will be satisfactory, and can be computed on the fly. The first and third require- 
ments are trivially satisfied, so we only have to show that Y2t a * = 00 • Consider 
the index set H = {t : a' t < 1/M 4 } U {t : a' t < 1/ ||x t || 4 }. By the first condition 
only finitely many indices are excluded. The second condition excludes indices with 
1/M 4 < a' t < 1/ ||x t || , which happens at most with probability e. However, 

t teH tEH 

The last equality holds, because if we take a divergent sum of nonnegative terms, 
and exclude finitely many terms or an index set with density less than 1, then the 
remaining subseries will remain divergent. 
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An update step of the algorithm is at^tXjXj . To make the proof simpler, we 
decompose this into a step size a' t and a direction vector (at / 'aQiJtXtX^ '. Denote 
the scaling factor by 

A t := ot/oj =min{l,l/(a' 4 ||x t || 4 )} 

Clearly, A t < 1. In fact, it will be one most of the time, and will damp only the 
samples that are too big. 
We will show that 

bt = AtStMX-l 

is a descent direction for every t. 

E{b t \w t fVJ(w t ) = A t E(5 t \wt)*t^ {w t - w*) 
= AtE{5 t \wt)^{U t -Ii*)^ t 
= AtE(6 t \w t )(Vt(xt)-V*(x t )). 

For the sake of simplicity, from now on we do not note the dependence on Wt 
explicitly 

We will show that for all t, E(IL t ) > 0, £(II t _i) > E(U t ) and E(S t ) < -pxf (II t — 
II*)xt. We proceed by induction. 

• t = 0. IIo > II* holds by assumption. 

• Induction step part 1: E(S t ) < —p (n t — II*)x 4 . 
Recall that 

(21) u t = argmfo(c(xt,u)+V t (Fx t +Gu)J 
where 

L t = -(R + G T Xl t G)- 1 {G T n t F) 
is the greedy control law with respect to V t . Clearly, by the definition of L t , 

c(x t , Ltxt) + V t (Fxt + GLfXt) < c(x t) L*x t ) + V t (Fxt + GL*x t ). 
This yields 

(22) E(S t ) = P Cfte) + (l-p)(c(xt,Lito) + Vt(Fxt + GLixt))-Vt(xt) 

< p c/(x t ) + (1 -p)(c(x t ,L*x t ) + V t {FM + GL*x t )) - F t (x t ). 

We know that the optimal value function satisfies the fixed-point equation 

(23) 0= (p C/ (xt) + (l-p)(c(xt,i*xt) + y*(Fxt + GL*xt))) -y*(x t ). 
Subtracting this from Eq. (j2"2")l , we get 

(24) E(6 t ) < (l-p)(V t (Fx t + GL*x t )-V*(Fx t + GL*x t )) 

-(K,(x t ) - V*(xt)). 

(25) = (l-p)xf( J F + Gi*) T (n t -n*)(F + Gi*)xt 

(26) -xf(n t -n*)x t . 
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Let ei = ei(p) := 1/(1 - p) - \\F + GL*\\ 2 > 0. Inequality flU} implies 

(27) E{5 t ) < (l-p)(-^—-e 1 ( P ))^{U t -Il*) Xt -^(U t -U*)^ t 

1 ~P 

(28) = -(i-p) ei (p)xf(n t -n*)x t . 

(29) = -e 2 (p)xf(n t -n*)x t) 
where we defined e 2 (p) = (1 — p)ei(p) 

• Induction step part 2: E(Tl t+ i) > II*. 

(30) £(5 t ) = p C/ (x t ) + (l-p)(c(x t ,L t x t ) + Vt(Fx t +GX t x t )) - Vt(x t ) 

> p C/ (x 4 ) + (1 - p)(c(x t , L t x t ) + V*(Fx t + GL t xt)) - V t (x t ). 

Subtracting eq. [221 we get 

(31) £(<5 t ) > (l-p)((c(x t ,L t x t ) + V*(Fx t + GL t x t )) 

-(c(x t , T*x t ) + V*(Fx t + GL*x t ))) + V*(x*) - Vtfa) 
> F*(x t )-^(x t )>-||n t -n*||||x t || 2 . 

Therefore 

(32) B(n t+ i)-n* > n f +a' t A tJ B((5 t )x t xf -n* 

(33) > (n t -n*)-a t ||x t || 4 ||n t -n*||/ 

(34) > (n t - it) - ||n t - n*|| i > o. 

• Induction step part 3: II t > E(JV t +i)- 

(35) n t -£7(n t+ i) = -a^ t £(^)x t x t r >a t e 2 (p)xf(n t -n*)x 4 -x 4 x t T , 

but aje 2 (p) > 0, x^(II t — IT*)x t > and x t x^ > 0, so their product is positive as 
well. 

The induction is therefore complete. 

We finish the proof by showing that the assumptions of Lemma IB . 1 1 hold : 

bt is a descent direction. Clearly, if J(w t ) > e, then HIT — n*|| > 63(e), but 
n f — IT is positive definite, so IT — II* > 63(e)/. 

E(b t \wtfVJ(w t ) = A t E(St\wt)(Vt(xt)-V*(x t )) 

< -e 2 (p)A t x?(n t ~ n*)x t • xf(n t - n*)x t 

< -e 2 e^4 t ||x t || 4 

< -e 2 e^min{||x t || 4 ,l/o4} 
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E(\\b t \\ 2 | tut) is bounded. \E(S t )\ < |xf (n t - IT)x t |. Therefore 



egmiV) < i^n^onixtii 2 

< ||n f -n*|| 2 .min{l,l/(af ||x t || 8 )} • ||x t || 6 

< ||n t -n*|| 2 .min{||x t || 6 ,l/K 2 ||x t || 2 )} 

< K-J{w t ). 

Consequently, The assumptions of lemma IB. II hold, so the algorithm converges 
to the optimal value function with probability 1. □ 
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